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We propose dimension reduction methods for sparse, high di- 
mensional muhivariate response regression models. Both the number 
of responses and that of the predictors may exceed the sample size. 
Sometimes viewed as complementary, predictor selection and rank 
reduction are the most popular strategies for obtaining lower dimen- 
sional approximations of the parameter matrix in such models. We 
show in this article that important gains in prediction accuracy can 
be obtained by considering them jointly. We motivate a new class of 
sparse multivariate regression models, in which the coefficient matrix 
has low rank and zero rows or can be well approximated by such a 
matrix. Next, we introduce estimators that are based on penalized 
least squares, with novel penalties that impose simultaneous row and 
rank restrictions on the coefficient matrix. We prove that these esti- 
mators indeed adapt to the unknown matrix sparsity and have fast 
rates of convergence. We support our theoretical results with an ex- 
tensive simulation study and two data analyses. 

1. Introduction. The multivariate response regression model 
(1) Y = XA + E 

postulates a linear relationship between Y, the m x n matrix containing 
measurements on n responses for m subjects and X, the m x p matrix 
of measurements on p predictor variables, of rank q. The term E is an 
unobserved mxn matrix with independent, A^(0,(T^) entries. The unknown 
p X n coefficient matrix A of unknown rank r needs to be estimated. If we 
use (1) to model complex data sets, with a high number of responses and 
predictors, the number of unknowns can quickly exceed the sample size m, 
but the situation need not be hopeless for the following reason. Let r denote 
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the rank of A and J denote the index set of the non-zero rows of A and \J\ 
its cardinahty. Counting the parameters in the singular value decomposition 
of A, we observe that in fact only r(n + \ J\ — r) free parameters need to 
be estimated, and this can be substantially lower than the sample size m. 
Furthermore, as we can always reduce X of rank q to an mx q matrix with 
q independent columns in that span the same space as the columns of 
X, we can always assume that \ J\ < q. If A is of full rank with no zero rows 
then the total number of parameters to be estimated reverts back to nq. 
If either, or both, q and n are large, more parsimonious models have to be 
proposed. Among the possible choices, two are particularly popular. 

The first class consists of rank sparse or rank deficient models, which 
postulate either that A has low rank or that it can be well approximated by 
a low rank matrix. Methods tailored to rank sparsity seek adaptive rank k 
approximations of the coefficient matrix A. Then, one only needs to estimate 
k{q-\-n—k) parameters, which can be substantially less than nq for low values 
oik. 

The second class of models reflects the belief that \J\ is smaller than q, 
and we will call them row sparse models. Methods that adapt to row sparsity 
belong to the variable selection class, as explained in Section 1.1 below. The 
effective number of parameters of such models is \J\n. This number is smaller 
than the unrestricted nq, but may be higher than r(| J| + n — r), especially 
if the rank of A is low. 

This discussion underlines the need for introducing and studying another 
class of models, that embodies both sparsity constraints on A simultane- 
ously. In this work we introduce row and rank sparse models, and suggest 
and analyze new methods that combine the strengths of the existing di- 
mension reduction techniques. We propose penalized least squares methods, 
with new penalties tailored to adaptive and optimal estimation in the row 
and rank sparse model (1). The rest of the article is organized as follows. 

We introduce in Section 2.1 a product-type penalty that imposes simulta- 
neously rank and row sparsity restrictions on the coefficient matrix. It gener- 
alizes both AlC-type penalties developed for variable selection in univariate 
response regression models as well as the rank penalty of Bunea, She and Wegkamp 
(2011) for low rank estimation in multivariate response models. The purpose 
of the resulting method is two-fold. Firstly, we prove in Theorem 1 of Sec- 
tion 2.1 that the resulting estimators of A adapt to both types of sparsity, 
row and rank, under no conditions on the design matrix. Their rates of con- 
vergence coincide with the existing minimax rates in the literature, up to a 
logarithmic term, cf. Koltchinskii, Lounici and Tsybakov (2011). Secondly, 
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we show in Theorem 2 that this method can also be employed for selecting 
among competing estimators from a large finite list. This is of particular in- 
terest for selecting among estimates of different ranks and sparsity patterns, 
possibly obtained via different methods. The results of Section 2.1 hold for 
any values of m, n and p and, in particular, both n and p can grow with m, 
but computing the estimator analyzed in Theorem 1 requires an exhaustive 
search over the class of all possible models, the size of which is exponential 
in p, and this becomes computationally prohibitive if p > 20. 

To address the computational issue, we propose two other methods in Sec- 
tion 2.2. The crucial ingredient of both methods is the selection of predictors 
in multivariate response regression models under rank restrictions. We define 
and analyze this core procedure in Section 2.2, and describe a computation- 
ally efficient algorithm in Section 3.1. By combining this method with two 
different ways of selecting the rank adaptively we obtain two estimators of 
A. Both are computable in high dimensions, and both achieve the rates 
discussed in Section 2.1, up to a log(p) factor, under different, mild assump- 
tions. We also compare the theoretical advantages of these new methods over 
a simple two stage procedure in which one first selects the predictors, and 
then reduces the rank. We illustrate the practical differences via a simulation 
study in Section 3.2. We then use our methods for the analysis, presented in 
Section 4, of two data sets arising in machine learning and cognitive neuro- 
science, respectively. The proofs of our results are collected in the Appendix. 

Background. Before we discuss our methods, we give an overview of ex- 
isting procedures of adaptive estimation in (1), that adapt to either rank 
or row sparsity, but not both. We also present a comparison of target rates 
under various sparsity assumptions on the coefficient matrix A in model (1). 

Reduced rank estimation of A in (1) and the immediate extensions to 
principal components analysis (PCA) and canonical correlation analysis 
(CCA) are perhaps the most popular ways of achieving dimension reduc- 
tion of multivariate data. They have become a standard tool in time series 
(Brillinger, 1981), econometrics (Reinsel and Velu, 1998) and machine learn- 
ing (Izenman, 2008), to name just a few areas. The literature on low rank 
regression estimation of A dates back to Anderson (1951). The model is 
known as reduced-rank regression (RRR) (Izenman, 2008) and, until re- 
cently, it had only been studied theoretically from an asymptotic perspec- 
tive, in a large sample size regime. We refer to Reinsel and Velu (1998) for 
a historical development and references, and to Izenman (2008) for a large 
number of applications and extensions. Very recently, a number of works 
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proposed penalized least squares estimators. For penalties proportional to 
the nuclear norm, we refer to Yuan et al. (2007), Candes and Plan (2010), 
Negahban and Wainwright (2011), Rohde and Tsybakov (2011). For penal- 
ties proportional to the rank, we refer to Bunea, She and Wegkamp (2011) 
and Giraud (2011). Both types of estimators are computationally efficient, 
even if max(n,p) > m, and both achieve, adaptively, the rate of convergence 
{q + n)r which, under suitable regularity conditions is the optimal minimax 
rate in (1) under rank sparsity, see, e.g., Rohde and Tsybakov (2011) for 
lower bound calculations. 

To explain the other notion of sparsity, note that removing predictor Xj 
from model (1) is equivalent with setting the j'th row in A to zero. Since 
vectorizing both sides of model (1) yields a univariate response regression 
model, we can view the rows of A as groups of coefficients in the transformed 
model. We can set them to zero by any group selection method developed for 
univariate response regression models in high dimensions such as the Group 
Lasso (Yuan and Lin, 2006), GLASSO for later reference. The optimal min- 
imax rate in (1) under row sparsity is proportional to \ J\n + \ J\ log(p/| J|), 
again under suitable regularity conditions, see Lounici et al. (2011) and 
Wei and Huang (2010). 

Despite these very recent advances, adaptive low rank estimation in (1), 
based on a reduced set of predictors, has not been investigated either theo- 
retically or practically. For ease of reference, Table 1 below contains a rate 
comparison between optimal prediction error rates achievable by variable se- 
lection (GLASSO), low rank estimation (RSC and NNP) and our new joint 
rank and row selection (JRRS) methods, respectively. 

Table 1 

Oracle rate comparison between JRRS, GLASSO, and RSC. The sample size and 
dimension parameters m,p, n, q, r, \J\ satisfy q < m A p, r < n A \ J\, \ J\ < q. 

GLASSO: n|J| + |J|log(p) 
RSC or NNP: nr + qr 

JRRS: nr + \J\r\og{p/\J\) 

The table reveals that if n > g, the rates of the RSC, NNP and JRRS are 
dominated by nr, regardless of J, while if n < g, the new class of methods 
can provide substantial rate improvements over the existing methods, espe- 
cially when the rank r is low. 

2. Adaptation to rovf and rank sparsity: estimation procedures 
and oracle inequalities. 
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2.1. The single-stage joint rank and row selection estimator. In this sec- 
tion we modify the rank selection criterion (RSC) introduced in Bunea, She and Wegkamp 
(2011) to accommodate variable selection. We propose our single-stage Joint 
Rank and Row Selection (JRRS) estimator 

(2) B = argminsllly - XB\\l + pen(S)}, 
also denoted by JRRSl, with penalty term 

(3) pen(S) = ca\iB) !^2n + log(2e)| J(i?)| + | J(i?)| log(^7^) 

The penalty is essentially proportional to the number of parameters in a 
model with fewer predictors J{B) and of reduced rank r{B). Here c > 3 
is a numerical constant, J{B) is the set of indices of non-zero rows, r{B) 
is the rank of a generic p x n matrix B and the squared Frobenius norm 
of a generic matrix M is denoted by ||M|||;. and is equal to the sum of the 
squared entries of M. 

If B is computed by minimizing over all p x n matrices B, then Theorem 
1 stated below shows that it adapts optimally to the unknown row and rank 
sparsity of A: the mean squared error of B coincides with that of optimal 
estimators of rank r and with |J| non-zero rows, had these values been 
known prior to estimation. However, the construction of B does not utilize 
knowledge of either r or J, hence the term adaptive. The minimax lower 
bounds for this model can be obtained by an immediate modification of 
Theorem 5 in Koltchinskii, Lounici and Tsybakov (2011). Our single-stage 
JRRS estimator B given in (2) above achieves the lower bound, up to a log 
factor, under no restrictions on the design X, rank r, or dimensions m,n,p. 

Theorem 1. The single-stage JRRS estimator B in (2) using pen{B) 
in (3) with c = 12^ satisfies 



E 



XA-XB\\l < W\\XA-XBfp + 8pen{B) + 768na^exp{-n/2) 



for any B with r[B) > 1. In particular, if r[A) > 1, 



E 



XA-XB\\l < aV(A) n + |J(A)|log 



'\JiA)l 



Here and elsewhere < means that the inequality holds up to multiplicative 
numerical constants. 



^Our proof shows that we may take c > 3, at the cost of increasing numerical constants 
in the right hand side of the oracle inequality. 
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The proof of Theorem 1 remains vahd if the matrices we select from 
depend on the data. Thus our procedure can be used for selecting from 
any countable list of random matrices of different ranks and with different 
sparsity patterns. We will make essential use of this fact in the next section. 

Theorem 2. For any collection of (random) non-zero matrices Bi, B2, ■ ■ ■ , 
the single-stage JRRS estimator 

(4) B = arg min^^, \\Y - XBj \\p + pen{Bj ) 

with c = 12 satisfies, 



E 



\XA-XB\\l\ < inf {WE[\\X A - XBjfp + 8E[pen(Bj)]} 
+768710-2 exp(-n/2). 



2.2. Two-step joint rank and row selection estimators. The computa- 
tional complexity of the single-stage JRRS estimator (2) is owed to the 
component of the penalty term proportional to J{B), which is responsible 
for row selection. The existence of this term in (3) forces complete enumer- 
ation of the model space. We address this problem by proposing a convex 
relaxation ||-B||2,i of this component. Here ||-B||2,i = Z^j=i ll^jlb is the sum 
of the Euclidean norms \\bj\\2 of the rows bj of S. In this section we propose 
two alternatives, each a two-step JRRS procedure and each building on the 
following core estimator. 



2.2.1. Rank- constrained predictor selection. We define our rank-constrained 
row-sparse estimators 5^ of A as 

(5) Bk = argnmv(5)<fc {||y - XBfp + 2A||5||2,i} . 

Here A is a tuning parameter and the minimization is over all p x n matrices 
B of rank less than or equal to (a fixed) k. A computationally efficient 
numerical algorithm for solving this minimization problem is given in Section 
3.1. Clearly, for k = q, there is no rank restriction in (5) and the resulting 
estimator is the GLASSO estimator; for A = 0, we obtain the reduced-rank 
regression estimator. Thus, the procedure yielding the estimators Bk of rank 
k acts as a synthesis of the two dimension reduction strategies, having each 
of them as limiting points. We will refer to B^ as the Rank Constrained 
Group Lasso (RCGL) estimators. 
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Since this estimator is central to our procedures, we analyze it first. We 
need the following mild assumption on S = X' X/m. 

Assumption 21. We say S G W^^ satisfies condition 21(7, 5i) for an index 
set / C {1, . . . ,p} and positive number 5/, iff 



for all pxn matrices M (with rows mj) satisfying ^^^^ ||mj||2 > ^^^^^c ||w-j||2. 
Remarks. 

1. The constant 2 may be replaced by any constant larger than 1. 

2. Assumption 21 allows designs with p > m, and can be seen as a version 
of the restricted eigenvalue condition in the variable selection litera- 
ture introduced in Bickel, Ritov and Tsybakov (2008) and analyzed in 
depth in Biihlmann and van de Geer (2011). 

3. A sufficient condition for (6) is : There exists a diagonal matrix D 
with Djj = 6i for all j & I and Djj = otherwise such that T, — D is 
positive definite. 

Let Ai(S) denote the largest eigen- value of S and set the tuning parameter 



for some numerical constant C > 0. Notice that A depends on k, but we 
suppress this dependence in our notation. 

Theorem 3. Let Bj. he the global minimizer of (5) corresponding to A 
in (7) with C large enough. Then, we have 



for any p x n matrix B with 1 < r{B) < k, \J{B)\ non-zero rows, provided 
S satisfies Assumption 2l(J(S), (5j(5)). 

Remarks. 

1. The term {n + | J(-B)| log{p)}ka'^ in (8) is multiplied by a factor k, = 
Ai (S)/ 5) . This factor can be viewed as a generalized condition num- 
ber of the matrix S. If k stays bounded, Theorem 3 shows that, within 



(6) 




(7) 



A = C7c7v/Ai(5])A;mlog(ep) 



(8) 



E \\XBk-XAfp 
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the class of row sparse matrices of fixed rank k, the RCGL estimator 
is row-sparsity adaptive, in that the best number of predictors does 
not have to be specified prior to estimation. 
2. It is interesting to contrast our estimator with the regular GLASSO 
estimator A that minimizes ||y — + 2A'||i?||2,i over all p x n 

matrices B. Our choice (7) of the tuning parameter A markedly dif- 
fers from the choice proposed by Lounici et al. (2011) for the GLASSO 
estimator A. We need a different choice for A and a more refined analy- 
sis since we minimize in (5) over all p x n matrices B of rank r{B) < k. 

2.2.2. Adaptive rank- constrained predictor selection. We now develop the- 
oretical properties of three methods, Method 1 (RSC— >-RCGL), Method 2 
(RCGL^JRRSl), and Method 3 (GLASSO^RSC). JRRSl denotes the 
single-stage JRRS estimator of Section 2.1. 

Theorem 3 suggests that by complementing RCGL by a method that esti- 
mates the rank consistently, we could obtain row and rank optimal adaptive 
estimator. This is indeed true. 

Method 1 (RSC^RCGL) 

• Use the Rank Selection Criterion (RSC) of Bunea, She and Wegkamp 
(2011) to select k = r as the number of singular values of PY that 
exceed cr(\/2n -|- ^/2q). Here P is the projection matrix on the space 
spanned by X. 

• Compute the rank constrained GLASSO estimator Bk in (5) above 
with A; = r to obtain the final estimator B^'^^ = Sf. 

This two-step estimator adapts to both rank and row sparsity, under two 
additional, mild restrictions. 

Assumption Ci. dr{XA) > 2\/2a{y/n^ ^fq) 
Assumption (Ti. log(||X^||i;^) < (\/2- l)2(n + g)/4. 

Assumption only requires that the signal strength, measured by dr[XA)^ 
the r-th singular value of the matrix XA^ be larger than the "noise level" 
loyjln + 2(/, otherwise its detection would become problematic. The tight- 
ness of Ci is discussed in detail in Bunea, She and Wegkamp (2011). Theo- 
rem 2 of that work proves that the correct rank will be selected with prob- 
ability^ 1 - exp{-co(n + q)] with cq = (\/2 - 1)^/2. 

^Hence, if ti + g is small compared to m, we suggest to replace q by glog(m) in the 
threshold level in the definition of r, in £i, and in £2, see the remark following Corollary 
4 in Bunea, She and Wegkamp (2011). 
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Assumption <tz is technical and needed to guarantee that the error due 
to selecting the rank is negligible compared to the rate nr + | J|r log(p). 

Theorem 4. Let S satisfy 2t(J, 8j) with J = J {A) / 0, let Ai(S)/5j he 
hounded, and let Ci and hold. Then the two-step JRRS estimator B^^^ 
with A set according to (7) with C large enough, satisfies 



E 



iXfiW -XAWl 



F 



< nr + \ J\r log(p). 



Hence 1?^^^ is row and rank adaptive, and achieves the same optimal rate, 
up to a log(p) factor, as the row and rank adaptive B studied in Theorem 
1 above. While Theorem 1 is proved under no restrictions on the design, 
we view the mild conditions of Theorem 4 as a small price to pay for the 
computational efficiency of B^^^ relative to that of B in (2). The practical 
choice of the threshold 2a^/2r^+2q in the initial step of our procedure can 
be done either by replacing cr^ by an estimator, as suggested and analyzed 
theoretically in Section 2.4 of Bunea, She and Wegkamp (2011), or by cross- 
validation. The latter is valid in this context for consistent rank selection, as 
the minimum squared error of rank restricted estimators in (1) is achieved 
for the true rank, as discussed in detail in Bunea, She and Wegkamp (2011). 

We now present an alternative adaptive method that is more computation- 
ally involved than method 1, as it involves a search over a two-dimensional 
grid, but its analysis does not require and d. 

Method 2 (RCGL^JRRSl) 

• Pre-specify a grid A of values for A and use (5) to construct the class 
B = {Bk,x--^l<k<q, XeA}. 

• Compute = aTgmin^^jg{\\Y - XB\\p + pen(S)}, with pen(i?) 
defined in (3) above. 

We have the same conclusion as for method 1: 

Theorem 5. Provided S satisfies condition 2l( J, 5j) with J = J {A) ^ 0, 
Ai(S)/(5j is hounded, and A contains A in (7) for some C large enough, we 
have 



E 



< nr +\J\\og{p)r. 



We see that B^'^'> has the same rate as B^'^^ under condition 2t on the 
design only. Our simulation studies in Section 4 indicate that the numerical 



10 



results of method 1 and method 2 are comparable. 

Remark. A perhaps more canonical two-stage procedure is as follows: 
Method 3 (GLASSO^RSC) 

• Select the predictors via the GLASSO 

• Use the Rank Selection Criterion (RSC) of Bunea, She and Wegkamp 
(2011) to construct an adaptive estimator, of reduced rank, based only 
on the selected predictors. 

It is clear that as soon as we have selected the predictors consistently in the 
first step, selecting consistently the rank in the second step and then proving 
row and rank sparsity of the resulting estimator will follow straightforward 
from existing results, for instance Theorem 7 in Bunea, She and Wegkamp 
(2011). Although this is a natural path to follow, there is an important caveat 
to consider: the sufficient conditions under which this two-step process yields 
adaptive (to row and rank sparsity) estimators include the conditions under 
which the GLASSO yields consistent group selection. These conditions are 
in the spirit of those given in Bunea (2008), for the Lasso, and involve 
the mutual coherence condition on S = (Sjj)i<jj<p, which postulates that 
the off-diagonal elements of S be small. Specifically, for the GLASSO, the 
restriction becomes < l/(7a|J|), for some a > 1, cf. Lounici et al. 

(2011), if it is coupled with the condition that n~^/'^\\aj II > Cim-V2[i + 
6*2^"^/^ logp]^/'^. Here ||aj|| is the Euclidean norm of the jth row vector of 
j4, and Ci and C2 are constants. For designs for which S is even closer to the 
identity matrix, in that |Sjj| < l/(14a| J|n), the condition on the minimum 
size of detectable coefficients can be relaxed to n~^/^||aj 
6*2^"^ logp]^/^, see Corollary 5.2 in Lounici et al. (2011). Our Theorems 4 
and 5 require substantially weaker assumptions on the design. 

3. Computational issues and numerical performance compari- 
son. 

3.1. A computational algorithm for the RCGL-estimator. In this sec- 
tion, we design an algorithm for minimizing 

(9) F{B-\):=\\\Y -XB\\l + \\\B\\2,i 

over all p X n matrices B of rank less than or equal to k. Recall that by solving 
this problem we provide a way of performing rank- constrained variable selec- 
tion in model (1). Directly solving the nonconvex constrained minimization 
problem for B in (9) may be difficult. One way of surmounting this difficulty 
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is to write B = SV\ with V being orthogonal. Then the rank constrained 
group lasso (RCGL) optimization problem is equivalent to finding 

(10) (S,V) = argmin5giRpxfeyg:0"x''^ll^ - + A||5||2,i, 

where the minimum is taken over all orthogonal n x k matrices V and all 
p X k matrices S. With a slight abuse of notation, we still denote the ob- 
jective function in (10) by F{S,V; X). We propose the following iterative 
optimization procedure. 



Algorithm £/ Rank Constrained Group Lasso (RCGL) Computation 

given l<fc<mApAn, A>0, Vj^"^ G 0"^*^ (say the first k columns of J„xn or the 
first k right singular vectors of Y) 
j 0, converged ^ FALSE 
while not converged do 

(a) Si]t'^ ^^rgmmsmpy^^l\\YV^^l~XS\\% + \\\Sh,,. 

(b) Let W <- F'XS'^-^+'' e R"^* and perform SVD: W = U^,D^V^. 

(c) Vl^^'^ ^U^Vl, ' 

(d) Bi^r^^s(^r^(v;'r')' 

(e) converged ^ \F{Bi't'^;X) - F{Bi'{;X)\ < e 

(f ) 3 ^ J + 1 
end while 

deliver B,,, = B'^^"' ^ S,., = 5^r^ V,,, = V^^^'\ 



The following theorem presents a global convergence analysis for Algo- 
rithm £/, where global in this context refers to the fact that the algorithm 
converges for any initial point. 

Theorem 6. Given A > and an arbitrary starting point V^'^^ G O"^'^, 

let (iS'^'^l, V^''l) (j = 1,2, - ■■ ) be the sequence of iterates generated by Algo- 
rithm £/ . The following two statements hold: 

(i) Any accumulation point of {S^l^,V^'^^) is a stationary point of F and 

^(^kX^^k''x) converges monotonically to F^S'l )^,V^ ^) for some sta- 
tionary point {SI ^, x)- 
(a) Suppose for any (S, V) outside the local minimum set of F, F{S, V) > 
min g I, F{S,V). Then, any accumulation point of {S^\,V^''l) is 

a local minimum of F and F{sl^\,V^''l) converges monotonically to 
^i^k A' ^kx) some local minimizer (5^ ^, x)- 
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Remarks. 

1. We run the algorithm to obtain a solution path, for each (A;, A) in a 
two-dimensional grid, or for a grid of A with k determined by RSC. 
From the solution path, we get a series of candidate estimates. Then 
the single stage JRRS (4) or other tuning criteria can be used to select 
the optimal estimate. 

2. Our results are of the same type as those established for the conver- 
gence of the EM algorithm (Wu, 1983). £/ can be viewed as a block co- 
ordinate descent method, but the conclusion in Theorem 6 is stronger 
in some sense: the guaranteed convergence to a stationary point (to 
be defined in Appendix 4.6) does not require the uniqueness of •S'^"'^^^ 
in Step (a) which is a crucial assumption in the literature (see, for 
instance, Bertsekas (1999) and Tseng (2001)). 

3. Step (a) needs to solve a GLASSO optimization problem. To see this, 
denoting the standard vectorization operator by vec and and the Kro- 
necker product by 0, we rewrite \\YVk — XSk\\'jp/2 + A||5fc||2,i as 
||vec((yVfc)')-(X®4)vec(5^)|||/2+A||5fc||2,i. Although this subprob- 
lem is convex, finding its global minimum point can still be expensive 
for large data. Instead, one may perform some low-cost thresholding 
for a few steps. Concretely, let K he a constant satisfying K > ||X||2/2. 
Given V G OP""^, define Ty : W'' W'' as 



(11) TvoS = e(^-^X'YV + {I-^X'X)S-^^ ,VSg 



3Xfc 



where is a multivariate version of the soft-thresholding operator 
e. For any vector a G R'', 6(a; A) := a6(||a||2; A)/||a||2 for a 7^ 
and otherwise; for any matrix A G W^^'' with A = [ai • • • a^]', 
G(A;A) := [e(ai;A) ••• 6(0,,; A)]'. 

We now replace Step (a) in £/ by SPt^^ ^ ^i/O) ° • • • ° T U) o si^l, 
where the number of T u), denoted by aj, satisfies I < aj < Muer for 

some Miter < 00 specified based on available computational resources. 
aj need not be equal. This algorithm, denoted by s^' , offers more flex- 
ibility and is more convenient than in implementation. Although 
at each iteration S^^^^ is not uniquely determined, a stronger global 
convergence result holds for s^' . 



Theorem 7. Given A > and an arbitrary starting point V^^^ G O"^'^, 
let {S^k\iV^^j^) (j = 1,2,---J he the sequence of iterates generated by si' . 
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Then, any accumulation point of {S^^^,V^''^) is a coordinatewise minimum 

point (and a stationary point) of F and F{S^\iV^''\) converges monotoni- 
cally to F{Sl ^, Vk \) f'^^ some coordinatewise minimum point {SI ^, Vk x)- 



3.2. Simulation Studies. The setup of our simulations is as follows. 

• The design matrix X has i.i.d. rows Xj from a multivariate normal 
distribution MVN(0, S), with Ejk = p'^'"^', p> 0, 1 < j,k <p. 

• The coefficient matrix A has the form 



A = 







' bBoBi ' 


O 




O 



with 6>0, Bq a J X r matrix and Bi a r x n matrix. All entries in 
Bq and Bi are i.i.d. iV(0, 1). 

• The noise matrix E has independent A^(0, 1) entries. Let Ei denote its 
i-th row. 

• Each row 1^ in y is then generated as = X'^A + Ei, 1 < i < m. 

This setup contains many noisy features, but the relevant features lie in a 
low-dimensional subspace. This structure resembles many real world datasets; 
see our examples in Section 4, where the low rank structure is inherent and 
thus rank-constrained variable selection is desired. 
We report two settings: 

p > m setup: m = 30, \ J\ = 15, p = 100, n = 10, r = 2, p = 0.1, cr^ = 1, 
b = 0.5,1. 

m > p setup: m = 100, | J| = 15, p = 25, n = 25, r = 5, p = 0.1, cr^ = 1, 
b = 0.2,0.4. 

Although we performed experiments in many other settings, say, with p = 
0.5, we do not report all results, as the conclusions are similar. The current 
setups show that variable selection, without taking the rank information 
into consideration, may be sub-optimal even if the correlations between pre- 
dictors are low. 

We tested five methods: RSC, GLASSO, method 1 (RSC^RCGL), method 
2 (RCGL^JRRSl), and method 3 (GLASSO^RSC), as described in Sec- 
tion 2.2. To minimize the influence of various parameter tuning strategies on 
our performance comparison, we generated a large validation dataset (10,000 
observations) to tune the parameter of each algorithm (with the exception 
of method 2) and we also generated another independent dataset of the same 
size as test data to evaluate the test error. Similar to the LARS-OLS hybrid 
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(Efron et al., 2004), for each GLASSO and RCGL estimate, we computed 
the least squares estimate restricted to the selected dimensions. We found 
that the resulting (bias corrected) solution paths are more suitable for pa- 
rameter tuning. For method 2, after getting the (bias corrected) solution 
path, we set c = 3 and cr^ = 1 in (3) to select the optimal 13^'^^; in contrast 
to the other two methods, no validation data is used for tuning. 

Each model was simulated 50 times, and Tables 2 and 3 summarize our 
findings. We evaluated the prediction accuracy of each estimator A by the 
mean squared error (MSE) \\XA — XA\\p/(mn) using the test data at each 
run. Since the MSE histograms turned out to be highly asymmetric, we 
computed the 40% trimmed-mean of MSEs as the goodness of fit of the 
obtained model. This trimmed mean is more robust than the mean and more 
stable than the median, and it therefore allows for a more fair comparison 
between methods. 

We also report the median number of predictors (denoted by |J|) and 
median rank estimate (denoted by R) over all runs. Estimators with small 
MSE and low \ J\ and R are preferred from the point of view of statistical 
modeling. 

Finally, we provide the rates of non-included true variables (denoted by 
M for Misses), and the rates of incorrectly included variables (FA for False 
Alarms). Ideally, both rates are low, especially the M-rates, since we do not 
wish to discard relevant features. 

Table 2 

Performance comparisons between GLASSO, RSC, method 1, method 2 and 
method 3 in the p > m experiment with b = 0.5, 1, \ J\ = 15 and r = 2. 





MSE 


1^1 


R 


M 


FA 




b-- 


= 0.5 








GLASSO 


206 


10 


10 


53% 


4% 


RSC 


485 


100 


2 


0% 


100% 


method 1 


138 


19 


2 


36% 


10% 


method 2 


169 


21 


2 


45% 


7% 


method 3 


169 


10 


2 


53% 


4% 




b 


= 1 








GLASSO 


511 


14 


10 


41% 


7% 


RSC 


1905 


100 


2 


0% 


100% 


method 1 


363 


21 


2 


31% 


12% 


method 2 


434 


25 


2 


30% 


13% 


method 3 


402 


14 


2 


41% 


7% 



We can draw the following conclusions from Tables 2 and 3. 
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Table 3 

Performance comparisons between GLASSO, RSC, method 1, method 2 and 
method 3 in the m > p experiment with b = 0.2, 0.4, | J| = 15 and r = 5. 







\-J\ 


ti 


M 


r A 




b 


= 0.2 








GLASSO 


18.1 


14 


14 


4% 


1% 


RSC 


11.9 


25 


5 


0% 


100% 


method 1 


8.3 


15 


5 


0% 


1% 


method 2 


8.3 


15 


5 


0% 


4% 


method 3 


8.9 


14 


5 


4% 


1% 




b 


= 0.4 








GLASSO 


17.7 


15 


15 


0% 


0% 


RSC 


11.5 


25 


5 


0% 


100% 


method 1 


8.1 


15 


5 


0% 


0% 


method 2 


8.0 


15 


5 


0% 


1% 


method 3 


8.1 


15 


5 


0% 


0% 



• We see that straightforward variable selection via GLASSO often severely 
misses some true features in the p > m setup as seen from its high 
M numbers. RSC achieved good rank recovery, as expected, but, by 
the definition of this estimator, it uses all p variables. Clearly both 
GLASSO and RSC alone are inferior to the three JRRS-type methods 
{methods 1, 2 and 3). 

• Method 1 dominates all other methods. Its MSE results are impressive 
and confirm the rate improvement established in Section 2.2 over the 
GLASSO and RSC. While method 1 may not give exactly \ J\ = \J\ = 
15, its M numbers indicate that we did not miss many true features. 

• Method 2, unlike method 1 and method 3, did not use the large vali- 
dation data for ideal parameter tuning, which explains its slight infe- 
riority relative to the other two methods. However, we see that even 
without validation-based tuning, which may at times be infeasible in 
practice, this method is a serious contender. It supports the theoret- 
ical findings of Theorem 2 on the usage of the penalty (3) for model 
comparison and tuning parameter selection. 

• The performance of method 3 is inferior to that of method 1 in the 
p > m experiment, and comparable with both method 1 and method 
2 in the m > p experiment, when the three methods have essentially 
the same behavior. 

In conclusion, we found that method 1 is the clear winner in terms of per- 
formance as well as computational speed, among the two-stage JRRS proce- 



16 



dures we considered, and is particularly appealing in the m < p regime. In 
particular, it shows the advantage of the novel penalty type which enforces 
simultaneous (row) sparsity and rank reduction on the coefficient matrix. 
Method 2 using penalty (3) provides evidence of success of Theorem 2. 

4. Applications. In this section, we apply method. 1 with its tuning 
parameters chosen via cross-validation, to two real datasets from machine 
learning and cognitive neuroscience. 

Norwegian Paper Quality. These data were obtained from a controlled ex- 
periment that was carried out at a paper factory in Norway (Norske Skog, 
the world's second-largest producer of publication paper) to uncover the ef- 
fect of three control variables Xi , X2 , X3 on the quality of the paper which 
was measured by 13 response variables. Each of the control variables Xi 
takes values in { — 1,0,1}. To account for possible interactions and non- 
linear effects second order terms were added to the set of predictors, yield- 
ing Xi, X2, X3, Xl,X2,X^,Xi ■ X2,Xi ■ X3,X2 ■ X3 and the intercept term. 
There were 29 observations with no missing values made on all response 
and predictor variables. The Box-Behnken design of the experiment and the 
resulting data are described in Aldrin (1996) and Izenman (2008). Since nei- 
ther the group penalty nor the rank constraint is imposed on the intercept 
term, we always center the responses and standardize the predictors in the 
training data (and transform the validation/test data accordingly). 

The data set can be downloaded from the website of Izenman (2008) and 
its structure clearly indicates that dimension reduction is possible, making it 
a typical application for reduced rank regression methods. The RSC method 
with adaptive tuning, as described in Bunea, She and Wegkamp (2011), se- 
lected the rank r = 3. This finding is consistent with Aldrin (1996), who 
assessed the performance of the rank 3 estimator by leave-one-out cross- 
validation (LOOCV) and obtained a minimum LOOCV error (total squared 
error, unsealed) of 326.2. We then employed the newly developed method 1 
to automatically determine the useful predictors and pursue the optimal pro- 
jections. Not surprisingly, the selected rank is still 3, yielding 3 new scores, 
which are now constructed from only 6 of the original 9 predictors, with X^, 
Xi ■ X2 and X2 ■ A3 discarded, and only the variables from Table 4 selected. 
The tuning result was the same for 10-fold CV and LOOCV. The minimum 
LOOCV error is now 304.5. We found no interaction effect between X2 and 
(Ai, A3), an interesting complement to Aldrin's analysis. Table 4 shows the 
construction weights of the 3 new orthogonal score variables from the rank-3 
RSC on the selected set of variables. They are ordered by an importance 
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measure given by the associated eigenvalues of Y' X{X' X)^^ X'Y =: W (see 
Reinsel and Velu (1998) and Izenman (2008) for the explanation). For in- 
stance, the first important score variable (accounting for 57.5% of the trace 
of W) can be roughly read as 2Xi - 0.5^2 - X^ - 0.5X| + X| + X1X3, or 
simply 2X1 + X1X3 + lx3=-i — 1x2=1- This can be used as a concise sum- 
mary predictor for all 13 response variables simultaneously and it quantifies 
the effect of the design variables on paper quality control. 

Table 4 

Paper quality control: Joint new feature construction from the selected predictors with 
F= 3, |J| — 6. The extracted components are ordered by their associated eigenvalues. 



New Scores 


Eigenvalues 


Xl 


X2 


X3 


xi 


xi 


Xl ■ X3 


1 


112.4 


1.9244 


-0.5288 


-1.2321 


-0.4443 


1.3109 


1.1898 


2 


40.7 


0.8231 


-0.5937 


0.9324 


0.7819 


-0.1599 


-0.8536 


3 


24.9 


0.2871 


1.0336 


0.4215 


0.8365 


0.4245 


0.4677 



Cognitive Neuroimaging. We present an analysis of the data set described 
in Bunea et al. (2011) and collected to investigate the effect of the HIV- 
infection on human cognitive abilities. Neuro-cognitive performance is typi- 
cally measured via correlated neuro-cognitive indices (NCIs). This study em- 
ployed n = 13 NCIs, falling into five domains of attention/working memory, 
speed of information processing, psychomotor abilities, executive function, 
and learning and memory. These indices were measured for 62 HIV-I- patients 
in the study. The set of explanatory variables was large and contained: (a) 
clinical and demographic predictors and (b) brain volumetric and diffusion 
tensor imaging (DTI) derived measures of several white-matter regions of 
interest, such as fractional anisotropy, mean diffusivity, axial diffusivity, and 
radial diffusivity, along with all volumetrics x DTI interactions. We refer 
to Bunea et al. (2011) for details. The final model has p = 235 predictors, 
much greater than the sample size m = 62. An initial analysis of this data 
set was performed using the RSC to select a model of rank 1 and construct 
the corresponding new predictive score. Although this is a massive reduc- 
tion of the dimension of the predictor space, all 235 initial predictors were 
involved in the construction of the new score. 

This leaves unanswered the important question as to what variables (es- 
pecially which DTI derived measures) are most predictive of the neuro- 
cognitive changes in HIV-|- patients. After standardizing the predictors, we 
run method 1. 

We selected a model of rank 1 and constructed one new predictive score 
but, very importantly, this score is a linear combination of only 10 predictors 
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that were selected from the original pool of 235. 

When we set aside 30% of the data as a separate test set, and used the re- 
maining to fit the model and tune the regularization parameters, the mean 
squared error (MSE) of the RSC estimate was 192.9, while the MSE of 
the newly proposed method was only 138.4. Moreover, our analysis demon- 
strates not only the existence of a strong association between the variable 
Education and the neuro-cognitive abilities of HIV+ patients, which had 
already been established by other means in the literature, but also suggests, 
as a perhaps new finding, that the variable fractional anisotropy at corpus 
callosum (f a_ccl) stands out among the very many DTI-derived measures, 
in terms of predictive power. 

Appendix. For a generic index set / C {1, . . . ,p}, we define the m x p 
matrix Xj as follows: its ith column coincides with that oi X i £ I, oth- 
erwise we set the entire column to zero. Furthermore, we define Pj as the 
projection matrix on the column space of Xj. 

Since we favor transparent proofs, we did not attempt to optimize various 
numerical constants. 

4.1. Proof OF Theorem 1. By the definition of i?, for any pxn matrix 
B with r{B) > 1, the inequality 

\\Y - XB\\l + pen{B) < \\Y - XBfp + pen{B) 

holds. This is equivalent with 

\\XA-XBfp < \\XA- XBfp + 2];)en{B) 
(12) + ^2 < E,XB - XB > -pen{B) - pen(B)| . 

We consider two complementary cases: r{B) > 1 and r{B) = 0. 

Case 1: r[B) > 1. We write J = J{B) and J = J{B), and we note that 
J U J = Ji U J2 U Js with Ji = JTi J, J2 = J n J, J3 = JTi J. Hence, 

3 

<E,X{B-B)> = Y,<E,Xj,{B-B)> . 

k=l 

The penalty term in (3) can be written as 

pen(S) = ca^r{B){2n + f{\J{B)\)} 
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for the function f{x) = xlog(2e) +xlog(ep/x). Since f"{x) = —l/x, f{x) is 
concave for a; > 0, and we have 

1 1 62? CV 

f{x + y) > -f{2x) + -f{2y) = x + y + xlog{-^) + ylog{-^) 

2 2 X y 

for ah x,y > 0. Consequently, writing ri = r{B), r-^ = r{B) and r2 = ri + r^, 

pen(B) + pen(^) = caVi{2n + /(| Ji| + | J2I)} + cc72r3{2n + /(| J2I + | J3I)} 

3 



^Tfc |n+ |Jfc| + |Jfc|log(|^)| . 



This implies that 

|2 < E,XB - XB > -pen(S) - pen(5)| 

3 r . 

<J2 2<E,Xj^{B-B) > -catkin + \Jk\ + \Jk\log{^] 
k=i L l-^^-l 

We define 

R = max(^dl{PiE)-^a^n + \I\ + \I\log{^^)} 

In this proof^, we set c = 12. Using the inequality 

<E,Xj^{B-B)> = <Pj^E,Xj^{B-B)> 

< di{Pj,E)^k\\Xj,iB - B)\\f 

and the inequahties 2xy < x'^/a + y'^a and (x + y)^ < x^(l + a) + y^(l + a)/a 
for all X, y G M with a = 2, we further bound 



2 < E,XB - XB > -pen{B) - pen(S)| 



' 1 

< 



2 \\Xj^{B - B)\\i, + 2ndi{Pj,E) - ca'n <n + \Jk\ + \Jk\ ^ogijfj) 



k=l 

< -\\XB - XB\\l + 6(maxrfc)i? 

2 k 



< ^\\XB - XA\\l + -\\XB - XA\\l + 12ni?, 



^\\XB-XA\\l + -\ 

A careful inspection reveals tliat we may take any c > 3, at the cost of larger constants 
elsewhere. 
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using max(ri, r2, ra) = r2 < 2n. Now, (12) and the display above yield 



XB-XA\\l < ^\\XB - XA\\l + 2pen{B) + 12nE[R] 

< 2 11^^ ~ ^Mf + 2pen(B) + 192na^ exp(-n/2), 



using Lemma 8 in the last inequality. This concludes the first case. 

Case 2: r[B) = 0. Using the same reasoning as above, we can argue that 

\\XAfp = \\XA-XBfp 

< \\XB - XAfp + 2pen(S) - 2 < E,XB > -pen{B) 
5,, „ ,,,2 , „ , 3 2 



< -\\XB - XA\\l + 2pen(S) + -\\XAfp + R' 



with 



R' = 2r{B)dl{Pj^B)E)-pen{B) 

< 2r{B) {dl{Pj^B)E) - ^a\n + | J(5)|)} 

By Lemma 3 in Bunea, She and Wegkamp (2011), we have E[(if (Pj(s)-£')] < 
2cj2(n + \ J{B)\), so that E[i?'] < for our choice c = 12 above. Hence 



1 



XA-XBfp < -\\XB - XA\\l + 2peii{B), 



2' 

which concludes the second case, and our proof. □ 



Lemma 8. We have 



E 



ma^ UfiPjE) - 6aUn + \ J\ + \ J\log I -j 



< 16a2exp(-n/2). 



Proof. Notice that d\{PjE) has the same distribution as d1{Zk) for a 
mxk matrix of independent A^(0, o"^) entries with k = \ J\. Consequently, 
for any i^fc > 0, 



E 



ma^(^dj{PjE)-a^i.fj^^ < p^(^^E[{dl{Zk)-aW) 



We write Vk = di{Zi^)/a and /i^ = -y/n + Vk, ^ ^ k < p. From Lemma 3 in 
Bunea, She and Wegkamp (2011), we have, 

^{Vk -lik>t}< exp(-tV2) 
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for all t > 0. Hence, for i^k = fJ-k + and 5k > 0, we obtain 
E[V,^-,^l]^ < E[vil{V,^>ul}] 

< I 2x¥{Vk > x} dx 
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< 



poo 

/ 2(^fc + s)exp(-s^/2)ds 
J&k 

= 2(l + ^)exp(-5fcV2) 
for 1 < A; < p. Choosing 5^ = 2k log{ep/k) + n + k, we find 



E 



max(d?(P,i^)-a2.f,|) < M!2(1 + ^) exp(-5^/2) 



k=l 

q 



6k 



k=l 
9 



6cj2 exp{-k/2) exp(-n/2) 



k=l 



< 16cr^exp(-n/2). 
Since z/| < 2(5| + 2/i| < 4:klog{ep/k) + 6k + 6n, the claim follows. 



□ 



4.2. Proof of Theorem 2. By the same reasoning as in the proof of 
Theorem 1, we obtain 

< ^\\XBj - XAfp + 2pen{Bj) 

+12nmax {dl{PiE) - |cj2(n + |/| + |/| log ■^)) 

for any random or not, with r{Bj) > 0. Taking expectations on both 
sides of the inequality and applying Lemma 8 gives the result. □ 



\XB 



XA\\l 



4.3. Proof of Theorem 3. We denote the row vectors of the matrix 
-Bfc by 6i, . . . , bp, and we write Jk = {i '■ hi ^ 0} for the index set of non-zero 
rows. Let B be any matrix with row vectors bi, . . . ,bp G with r{B) < k 
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and J = {j ■ bj / 0} such that S = X'X/m satisfies condition 2l(J, (5j). 
We use X to denote Xj with J = J^L) J and we write = \\XB — XA\\p 
and = — In this notation, we have by the definition of 

A^ + 2A||i^fc||2,i < A^ + 2<E,X{Bk-B)>+2X\\B\\2,i. 
This implies that 

(13) Al + 2X\\{B-Bk)j42,i 

<A^ + 2<E, X{Bk -B)> +2\\\{B - Bk)j\\2,i. 

For the second term on the right of (13), we note that 

2<E,X{Bk-B)> = 2 < PjE,X{Bk- B) > 

< 2di{PjE)V2k\\X{Bk- B)\\f 

< 2di{PjE)V2k{Ak + A) 

< 16kdl{PjE) + ^{A'^ + A'^) 

using r{Bk — B) < r{Bk) + r{B) < 2k and the inequahty 2xy < Ax^ + ?/^/4 
for ah G M. This bound and inequahty (13) give the inequality 

(14) \Al + 2\\\{Bk-B)j.h,i 

< ^A2 + IGkdjiPjE) + 2X\\{Bk - B)j\\2,i. 
For the remainder of the proof, we consider two complementary cases. 
Case 1. Assume that 

(15) ^A2 + IQkdliPjE) < 2\\\{Bk - B)j\\2,i. 
In this case, (14) implies that 

||(Sfc-5)jc||2,i < 2||(5fe-5)j||2,l. 

Since S satisfies 2t(J, (5j), we have 

\\XBk-XB\\l > m6jY,\\bi -hWl 
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This inequality and the inequahty 2xy < ja + y^a apphed to a = 8/(5j, 
give 



2A||(S-4)j||2,i = 2A^||6i-6i||2 

< —)?\j\^ — y\\hi-h. 

m n ^ — ^ 



2 

, ^ .. 'ill2 

m a ■' — ' 



(16) < -X'^\J\ + -^\\XBk-XB\\l 

m ad J 



< ^AVl + ^(Ai + A') 
m ' ' a6,j ^ 

8A^|e7| 1 .'T-n . n, 

= — P + jAi + A^. 

moj 4 

After we combine (14), (15) and (16), we obtain 

\Al < 4X\m-B)jh,, < ^^ + ^(Af. + A^ 

whence 

(17) Al < 2A2 + ^1^ = 2A2 + 64CA:aV|log(ep)^ 
for the choice = CAi(S)mA;(7^ log(ep). This concludes the first case. 

Case 2. Assume that 

(18) ^A2 + IGkdliPjE) > 2X\\{Bk - B)j\\2,i. 
In this case, (14) now gives 

(19) 3A^ < 10 + 6Akdl{PjE). 
By Lemma 8, we have 



ij| + ij|iog(^: 



E[dj{PjE)] < Qa'^n + 16cj2 exp(-7V2) + efj^E 
(20) < Qa'^n + 16a^ exp(-n/2) + 120"^ log(ep) ( | J| + E 



\Jk\ 

We now bound E[| Jfc|]. Since r{Bk) < we may write B^ = SV^ for some 
Vk with k columns satisfying V^Vk = /fcxfc- Following the lines of argument 
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in the proof of Lemma 9, with V fixed at 14, S is the (globahy) optimal 
solution to the convex problem of (25). Let Xi and Sj for 1 < i < p, he the 
column vectors of X and S' , respectively. Using the Karush-Kuhn- Tucker 
condition of 5', we obtain 

(21) \\%\\^0^\\si\\^0^\\x[{XS-YVk)\\2 = X, 
so that 

\Jk\X^ = Y.\\x'^{XS-XAVk-EVk)\\l 

= \WMS-XAVk - Pj^EVk)\\l 

< 2Y,\Wj{XS-XAVk)\\l + 2Y,\WjPj^EVk)\q 

< 2\i{X'X)\\XBk - XA\\l + 2\i{X'X)\\Pj^EVk\\l 

< 2mAi(S)|A| + fc(i?(Pj^^)} 

using riVk) ^ ^- Taking expectations on both sides, 

(22) < ''-^^^^[ml]+mi{PjE)]] 

since C J. After we combine (20) and (22), we get 

E [dl{PjE)\ < Ga'^n + 16a^ exp(-n/2) + 120-^ log(ep)| J| 

+2Aa' log(ep)^^^ [e[AI] + kE[dl{PjE)]} . 

Taking = C log{ep)mka'^ Xi{T.) with C = 792, we find 

64fc E[dl{PjE)] < 66k [6a^n + 16a^ exp(-n/2) + 12a^ log(ep)| J|] + 2E[A^]. 
Inserting this bound in (19), we obtain 

(23) E[A|] < lOA^ + 66A: [6a^n + 16a^ exp(-n/2) + 12(7^ log(ep)| J|] . 

This concludes the second case. Our risk bound (8) follows directly from 
(17) and (23). □ 



JOINT VARIABLE AND RANK REDUCTION 



25 



4.4. Proof of Theorem 4. Recall that f is the number of eigen- values 
of Y'PY that exceed the threshold level 2fi = 4(T^(-v/n + y^)^. Theorem 2 
and Corollary 4 in Bunea, She and Wegkamp (2011) show that for cq = 
(V2- 1)2/2, 

P{f / r} < exp{-co{n + q)}. 
Next, we decompose the risk as follows: 



(24) E 



\XA - XB'^^Mll 



E \\\XA- XB^^^\\ll{?= r} 



+ E 



\XA- XB^^^\\ll{?y^r} 



The first term on the right gives the bound obtained in Theorem 3 for k = r. 
It remains to bound the second term. 



Let O denote the p x n matrix with all entries equal to zero. Then, since 
r = r{B) > r{0) = 0, we have the inequality 

||y - XB^^^ll + 2A||sW||2,i < ||y - XOfp + 2A||0||2,i = \\Y\\l 

by the minimzing property of ]3^^\ Using Pythagoras, \\PY — XIB^^^\'jp < 
||Py 11^ for P the projection matrix on the column space of X. Consequently, 

E \\\XA - XB^^'^\\ll{?^ r} 

< 2E |||Py - XA\\l + \\PY - XB^^^\\l^I{?^r} 

< 2E [{ \\PEfp + \\PY\\l} I{? / r}] 

< 6E [||P£;|||,/{f / r}] +4\\XA\\lF{?j^ r}. 

Since ||PS|||,/(t2 has a Chi-square distribution with qn degrees of freedom 
(see Lemma 3 in Bunea, She and Wegkamp (2011)), we have 

E < {q'^n^ + 2gn)cj^ < 2q'^n^a^. 

We obtain, using the Cauchy-Schwarz inequality, 

E[\\PEfpI{?^r}] < V2qna^ exp{-co{n + q)/2}, 



and so 



E 



\XA- XB^^^fpI{?y^r} 



< 4||XA|||.exp{-co(n + g)} + 6V2qna'^ exp{-co(n + g)/2} 

The second term on the right is clearly bounded. For the first term on the 
right, we invoke condition (ti on ||X74|||.. This completes our proof. □ 
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4.5. Proof of Theorem 5. In this proof, Ci,...,Cj are numerical, 
positive and finite constants. Theorem 2 of Section 2, apphed to the random 
matrices i?A,fci yields 



< minE 



l^\\XBx,k - XAfp + 8pen(SA,fc) + TeSna^ exp(-n/2) 



< CiminE 

A 



\\XBx,r - XA\\l + pen(SA,r) + na"^ 



(Here we used c = 12 in the penalty term.) Since S satisfies 2l(J(A), 

and we assume that Ai(S)/5j(^) < C3, Theorem 3 yields, for each global 

solution -Ba.d 

n\\XBx,r-XA\\l] < C2{n + \J{A)\\og{p)}ra\ 

for A'^ = C a"^ \i{Tj)km\og{ep) with C large enough. It remains to bound the 
expected penalty term E[pen(i?A,r)]- Since 



E[pen(5A,.)] < Ci{n + \og{p)¥.[\J{B, 



A,rJ 



we need to bound E[| J(i?A,r)|]- We write Jr = J{Bx^r)- From (22) in the 
proof of Theorem 3, we have 

nJr\] < ^"^^i(^) {E[||xgA.r - ^^11^] + rndjiPj^E)]} , 
while Lemma 8 gives 

E[dl{Pj^E) < GncT^ + IGci^ exp(-n/2) + 12a^ log(ep)E[| J|]. 

Therefore, taking = Ca^Xi{T,)kmlog{ep) with C large enough, we obtain 
from the previous three displays 

E[|X|] < C5{mXBx,r-XA\\l]/r + nayiog{p)] 
< Cea^n+\J{A)\logip)} 

We now can conclude that 

E[\\XB^^^ - XA\\l] < C7a\{n + log{p)\J{A)\} , 

and the proof is complete. □ 
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4.6. Proof of Theorem 6. In this proof, we drop the subscripts X,k 
in the {S, V) iterates. 

Proof of Part (ii) The proof of this part of Theorem 6 will follow from 
the Global Convergence Theorem (GCT) of Zangwill (1969). For complete- 
ness, we state this theorem below, then we verify that its conditions hold in 
Lemmas 9, 10 and 11. 

The Global Convergence Theorem (Luenberger and Ye, 2008). Let 
A be a map describing an algorithm on X and suppose that given xq the 
sequence {xk^'^^i is generated by x^+i G A^x^)- Let a solution setT C X be 
given, and suppose: 

1. All points Xfc are contained in a compact set S C X ; 

2. There exists a continuous function Z on X such that (a) if x ^ T, then 
Z{y) < Z{x) for all y G A{x) (b) if x £ T , then Z{y) < Z{x) for all 
y G A{x); 

3. The mapping A is closed at points outside T. 

Then the limit of any convergent subsequence of {xk}k is a solution. 

We begin by introducing a map, usually referred to in the literature as a 
point-to-set map, to characterize our algorithm . Let O = M^^'^ x O"^'^. 
Define : ^ ^ 2^, A^^ : ^ 2^^ as follows 

M^{S,V) = {{S,V)e^: inf F {S ,V) > F {S ,V)] , 
M^{S, V) = {(S, V)e^: inf V) > F{S, V)}, 

and define A4 = Ai^AA^ as a composite point-to-set map, see Luenberger and Ye 
(2008) for more details. Algorithm £/ can be described by 

that is, (S(J'+i),Wj')) G 7W^(5(j'),T/(j'))and (S^^'+i), W^'+i)) G {S^^+'^\V^^^). 
Recall that 

F{S,V;X) = ^\\Y - XSV'Wl + X\\SV'\\2,i, S G F g 0"^^ 

and that we analyze the unconstrained minimum of F over the product 
manifold W^^ x O"^*'. For simphcity, we will write just F{B) and F{S, V) 
for F{B; X) and F{S, V; A), respectively, when there is no ambiguity. 

Lemma 9 shows the algorithm converges globally for any initial starting 
point. 
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Lemma 9. For any j > 0, rank{B^^^) < k and F{B^^^) > F{B'-^+^^). 

Proof. We write F{S,V) = l\\YV - XS\\l + A||5||2,i + |tr(y(/ - 
VV')Y'). Given V, (10) reduces to the following optimization problem after 
vectorization: 

(25) i||vec((yy)') - (X ® /)vec(5')||| + A||5||2,i, 

where 'vec' is the standard vectorization operator, and (8) is the Kronecker 
product. This is a GLASSO-type optimization problem that is convex in 
S. The global minimum of (25) can always be achieved at some S with 
ll^ll < oo. Given S, writing F(5, F) = -tT(Y'XSV') + \\Y\\l/2+\\XS\\l/2+ 
A II 5*11 2,1, we see the optimization problem is equivalent to 

(26) max tr{W'V), 

where W = Y'XS G M"^^ The (global) maximum can be attained, too, 
due to the compactness of O^'^'p. In fact, tr{W'V) < J2idi(W) by von 
Neumann's trace inequality. Let W = UwDyjV^ be the SVD with Ki G O''^''. 
Then 

(27) V = [/^K, 

achieves the upper bound '^idi{W). This globally optimal solution to (26) 
is the one used in Algorithm £/. Therefore, we have 

that is, F{b[^1) > F{bI^'^^^) during each iteration. □ 

Lemma 10. Suppose A > 0. Then B^^\ S^^\ V^^^ in £/ are uniformly 
bounded in j . 

Proof. From Lemma 9, ||S(^')||2,i < F{S^^\V^^'^)/ \ < F{S^^\V^^'>)/ \ < 
F(0,y(°))/A < ||y|||,/(2A). Therefore, ||5(^)|| must be uniformly bounded. 

□ 

Lemma 11. Suppose A > 0. The map A4 introduced for describing £/ is 
a closed point-to-set map on 



Proof. Notice that 
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• The set Q = W^^ X O"'^^ is closed because O"^'^ is the inverse image 
of {/} under the continuous function M ^ M'^M, VM G M"^''; O"^^ 
is in fact an embedded submanifold of M"^*^. 

• M^{uj) 7^ 0,7W^(a;) / 0, Vw G 0, seen from the proof of Lemma 9. 

First, we prove that TW"^ is closed on Q. It suffices to show the point-to- 
set map Ai'^{x) = {y G MP^^ : F{y,x) < miiiy^^pxk F{y,x)} is closed at 
any x G O"^'^. Let xj — > x*, yj G M^{xj) and yj — ^ y*, with Xj,x* G 
0''''"',yj,y* G W^. Suppose y* ^ M'{x*). Since y* G RP'"', F{y*,x*) > 
miny^^pxk F{y,x*) =: L. There exists some eo > such that F{y*,x*) > 
L + Eq. Let y G M'^ix*). Then F{y,x*) = L. Since F is continuous at {y.,x*), 
liuij^oo F{y, Xj) = L. For j large enough, \F{y,Xj) — L\ < eo/2, from which 
it follows that F{yj,Xj) < F{y,Xj) < L + eo/2 and F{y*,x*) < L + eo/2. 

The contradiction implies y* G A4^{x*). Hence and thus A4^ are 
closed. From the proof of Lemma 10, we also know is compact. Similarly, 
we can show is closed on M^^*^. Based on the properties of the point- 
to-set maps (Luenberger and Ye, 2008, p. 205), ^A is closed on Q. □ 

If we set Z in the general statement of the Global Convergence Theorem 
to be our continuous criterion function F, and if we take T to be the set 
of local of minima of F, Lemma 9 and the assumption in Part (ii) of our 
Theorem 7 guarantee that (2) of GOT holds. Lemmas 10 and 11 verify con- 
ditions (1) and (3) of GCT, respectively . This concludes the proof of this 
part. □ 

Proof of Part (i) From displays (25) and (26) we observe that F is convex 
in S, given V, and it is linear and therefore smooth in V, given S. This part 
of the theorem shows that, under no further conditions on F, algorithm ^ 
converges to a stationary point of F. We begin by defining a stationary point 
in this context. Recall that we view (10) as an unconstrained optimization 
problem in = M^^'^ x O"^'^. Notice that O"^'^, which is a Stiefel manifold, 
is a Riemannian submanifold of M"^'^. We use the inherited Riemannian 
metric to define the gradient of F with respect to V (Boothby, 2002). This 
Riemannian gradient, denoted by VvF{S,V), can be explicitly computed: 
VvF{S,V) = Vvi-W) = -W + VV'W/2 + VW'V/2 with W = Y'XS, 
where Vy is the projection onto the tangent space to O"^*^ at V . 

Since F is convex in S, the sub differential of F with respect to S, denoted 
by dsF, is also well defined. From Gabay (1982) and Shimizu, Ishizuka and Bard 
(1997, p. 62), a necessary condition for F to have a local minimum at 
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{S* ,V*) is that this point is stationary, that is, 

(28) e dsF{S*,V*), and VvF{S* ,V*) = 0. 

For future use, note that VvF{S, V) is continuous on Q. Since 
minimizes F(5(J+1),-), we have VvF{S^^+'^\V^^+^^) = 0, for any j > 0. 

Because the optimum of (25) (or (26)) may not be uniquely attained, F 
is not guaranteed to be a descent function in general (see Zangwill (1969), 
Bertsekas (1999) and Luenberger and Ye (2008) for details). Therefore, GCD 
cannot be directly applied. 

From Lemma 9, F(5'(j), F^^^) must converge. Denote the limit by L* . Let 
(50i), yOi)) (Z = 1, 2, • • • ) be a subsequence of {S^^\V^^'^) which converges 
to {S*,V*) as / ^ oo. Then L* = limi^oo F{S^^'\V^^'^) = F{S*,V*). We 
assume, without loss of generality, (^S^^'~^^\V^^^^^^) also converges (Lemma 
10), and denote the limit by (5, V). We have 

(29) F{S,V) = L*. 

We claim that {S*,V*) must be a stationary point of F. First, the con- 
tinuity of VvF{S,V) and the fact that VvF{S^^\v'^^^) = imply that 
VvfIs*,V*) = 0. Suppose ^ dsF{S*,V*). This impli_es_5* is not a global 
minimizer of (25). Lemma 11, however, states that {S,V) G M{S* ,V*). 
That is, L* = F{S,V) < F{S,V*) < F{S*,V*) = L*, which contradicts 
(29). The last inequality is strict because F is convex in S and so, applying 
the algorithm to S*, which is not a global minimizer, yields a strict improve- 
ment (decrease) of the criterion function. Hence G dsF{S*, V*). The proof 
is complete. □ 



4.7. Proof of Theorem 7. We use the same notation system as in 
Appendix 4.6. Recall that (S*, V*) € is a coordinatewise minimum point 
of F{S, V) if F{S, V*) > F{S*, V*), V5 e and F{S*, V) > F{S*, V*), 

\/V G O"^^ (Tseng, 2001). In our problem, this implies {S*,V*) is also a 
stationary point of F{S, V). 

Without loss of generality, assume X has been scaled to have ||X||2 < 1 
before running Algorithm For simplicity, set ii' = 1 in (11) and redefine 
the operator Ty : W""" W"" by 

(30) TyoS = Q{X'YV+{I -X'X)S;\) ,\fS eW'"'. 

Let Ty (a G N) be the composition of a Ty's. Define point-to-set maps 
M^{S,V) = {{S,V) e n-. S e {Tyo S,T^ oS,--- ,T^'*='- o S},V = V}, 
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and M^{S, V) = {{S, V) £ n : mfy^^„^,, F{S, V) > F{S, V), S = S}. Then 
A4 = M'^ characterizes When updating S at Step (a), the algorithm 
allows one to perform T any times (denoted by aj) provided aj does not go 
beyond Mner G N that is prespecified before running £/' . 

Lemma 12. Given any V G O"^'' and S G W"", let S = Tv{S). Then 
F{S,V) - F{S,V) > ^\\S- S\\l. 

Proof. Apply the theorem in She (2012) to the vectorized problem (25). 
Note that 2 — ||X <^ I\\2 = 2 — ||^||2 > 1- The proof details are omitted. □ 

Choose {S, V) G A4^{S, V), using the triangle inequality we know F{S, V) — 
F{S,V) > ^\\S-S\\l. 

Lemma 13. Suppose A > 0. Then B^^\ S^^\ V^^^ in £/' are uniformly 
bounded in j. 

The proof is similar to the proof of Lemma 10 and therefore omitted. 



Lemma 14. Suppose A > 0. The M for describing £/' is a closed point- 
to-set map on il.. 

Proof. Similar to the proof of Lemma 11, we prove that the point-to-set 
map M'{S, V) = {TvoS,T^oS,-- - , Ty'"^^ o S} is closed at any (5, V) e fl. 
Then and thus A4 are closed on Q. 

Let {Sj,Vj) {S*,V*), Sj e M'{Sj,Vj) and Sj S*, with {Sj,Vj), 
iS*,V*) G n and Sj,S* G Rf^^ There must exist infinitely many S/s 
satisfying T° o {Sj, Vj) = Sj for some a G N. Let g{S, V) = T^{S, V). It is 

not difficult to see that g is jointly continuous. Hence S* G 7W^(5*,y*) by 
a subsequence argument. □ 

Now we prove Theorem 7. Following the lines of the proof of Part (i) of 
Theorem 6, for any accumulation point {S*,V*) of {S^^\V^^'^), {S*,V*) G 
n and_ there exists {S,V) G M{S*,V*) with F{S,V) = F{S*,V*). Since 
F{S,V) < F{S,V*) < F{S*,V*), F{S,V*) = F{S*,V*). It follows from 
the comment after Lemma 12 that S = S* . This means Tyi o S* = S* for 
some ao G N. But then F{T^, o S*, V*) = F{S*,V*) for any a < ao, and in 
particular, F{Tv' o S*,V*) = F{S*,V*). Applying Lemma 12 again yields 
Tv* o 5* = 5*. It is easy to verify from (30) that S* is a fixed point of 
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Ty is equivalent to G dsF{S*, V*). Therefore, S* is a global minimizer of 
F{S,V*) given V*, due to the convexity of (25). 

On the other hand, from S = S*, we have {S*,V) e_M^{S*,V*). V 
is a (global) minimizer of F{S*,V) given 5*. But F{S*,V) = F{S,V) = 
F(5*,y*), so also minimizes F{S*,V) given 5*, and VvF{S*,V*) = 0. 
In summary, (S*, V*) is a coordinatewise minimum of F. □ 



References. 

Aldrin, M. (1996). Moderate projection pursuit regression for multivariate response data. 

Comput. Stat. Data Anal. 21 501-531. 
Anderson, T. W. (1951). Estimating linear restrictions on regression coefficients for 

multivariate normal distributions. Annals of Mathematical Statistics 22 327-351. 
Bertsekas, D. (1999). Nonlinear Programming. Athena Scientific. 

BiCKEL, P. J., RiTOV, Y. and Tsybakov, A. B. (2008). Simultaneous analysis of Lasso 
and Dantzig selector. Annals of Statistics. 37 1705-1732. 

BOOTHBY, W. M. (2002). An Introduction to Differentiable Manifolds and Riemannian 
Geometry, Revised, Volume 120, Second Edition (Pure and Applied Mathematics). Aca- 
demic Press. 

Brillinger, D. R. (1981). Time Series: Data Analysis and Theory, Expanded edition ed. 
Holden-Day, San Francisco, CA. 

BijHLMANN, p. and VAN DE Geer, S. (2011). Statistics for High- Dimensional Data: Meth- 
ods, Theory and Applications. Springer Series in Statistics. 

BuNEA, F. (2008). Honest variable selection in linear and logistic regression models via 
l-i and li + 12 penalization. Electronic Journal of Statistics 2 1153 - 1194. 

BuNEA, F., She, Y. and Wegkamp, M. (2011). Optimal selection of reduced rank esti- 
mators of high-dimensional matrices. Annals of Statistics 39 1282-1309. 

BuNEA, F., She, Y., Ombao, H., Gongvatana, A., Devlin, K. and Cohen, R. (2011). 
Penalized Least Squares Regression Methods and Applications to Neuroimaging. Neu- 
rolmage 55 1519 - 1527. 

Candes, E. J. and Plan, Y. (2010). Tight oracle bounds for low-rank matrix recovery 
from a minimal number of random measurements. IEEE Transactions on Information 
Theory 57(4) 2342-2359. 

Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least Angle Regres- 
sion. Annals of Statistics 32 407-499. 

Gabay, D. (1982). Minimizing a differentiable function over a differential manifold. Jour- 
nal of Optimization Theory and Applications 37 177-219. 

Giraud, C. (2011). Low rank multivariate regression. Electronic Journal of Statistics 5 
775 - 799. 

IZENMAN, A. J. (2008). Modern Multivariate. Statistical Techniques: Regression, Classifi- 
cation and Manifold Learning. Springer, New York. 

KOLTCHINSKH, V., LouNici, K. and Tsybakov, A. B. (2011). Nuclear-norm penalization 
and optimal rates for noisy low-rank completion. Annals of Statistics 39 2302-2329 

LouNici, K., PONTIL, M., Tsybakov, A. B. and van de Geer, S. (2010). Oracle Inequal- 
ities and Optimal Inference under Group Sparsity. Annals of Statistics 39 2164-2204 

Luenberger, D. and Ye, Y. (2008). Linear and Nonlinear Programming, third ed. 
Springer. 



JOINT VARIABLE AND RANK REDUCTION 



33 



Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices 
with noise and high-dimensional scaling. Annals of Statistics 39 1069-1097. 

Reinsel, G. C. and Velu, R. P. (1998). Multivariate Reduced- Rank Regression: Theory 
and Applications. Springer, New York. 

ROHDE, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank ma- 
trices. Annals of Statistics 39 887-930. 

She, Y. (2012). An Iterative Algorithm for Fitting Nonconvex Penalized Generalized 
Linear Models with Grouped Predictors. Computational Statistics & Data Analysis. 56 
2976-2990. 

Shimizu, K., Ishizuka, Y. and Bard, J. F. (1997). Nondifferentiable and Two-Level 

Mathematical Programming. Kluwer Academic Publishers. 
Tseng, P. (2001). Convergence of Block Coordinate Descent Method for Nondifferentiable 

Minimization. J. Optim. Theory Appl. 109 475-494. 
Wei, F. and Huang, J. (2010). Consistent group selection in high-dimensional linear 

regression. Bernoulli 16 1369-1384. 
Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Annals of 

Statistics 11 95-103. 

Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped 
variables. Journal of the Royal Statistical Society, Series B 68 49-67. 

Yuan, M., Ekici, A., Lu, Z. and Monteiro, R. (2007). Dimension Reduction and Co- 
efficient Estimation in Multivariate Linear Regression. Journal of the Royal Statistical 
Society, Series B 69 329-346. 

Zangwill, W. I. (1969). Nonlinear programming: a unified approach. Prentice-Hall in- 
ternational series m management. Prentice-Hall. 

Department of Statistical Science Department of Statistics 

Cornell University Florida State University 

E-MAIL: fb238@corncll.cdu E-MAIL: ysheOstat. fsu.edu 

Department of Mathematics & Department of Statistical Science 

Cornell University 

E-mail : marten . wegkamptQ! Cornell . edu 



